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About Me 


e Worked on all parts of graphics and 
other driver stacks 
Avid Buffer Modifier 
Motivated by disparity with closed 
implementations 
o If they can do it; damn it, we can too. 
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Summarizing the Work 


Linux DRM Linux i915 Protocol Khronos 
° blobifier ° blobifier ° DRI3.1 ° image dma buf import mod 
° AddFB2 ° AddFB2 ° X.org ifiers 
e multi-plane ° xcb VK_EXT external * 
Wayland 
o Mutter 
° Weston 
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Status 


e Many years in the making: almost there! 
e Early modifier support already released: Mesa 17.2, Kernel 4.14 
e Fullcompression support soon: 

o Modesetting 

o Wayland/Mutter? 


o° X.org/DRI3 
=" Mesa DRI3 support 
= Mesa Wayland support 
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Mountains out of Molehills? 


e Each EU needs 1GB/s bandwidth 
o Texturing (trilinear, anisotropic) 
o Transparency/Blending 
o Antialiasing 
e Display Giant 
> 3840 px * 2160 rows * 4 Bpp * 60 Hz = 1.85GB/s Salt 


o It keeps getting worse Grain 
= Increasing resolutions (5K, 8K) 
= Increasing refresh rates (120Hz, 240 Hz) 


e Workloads are already memory bandwidth limited 
o Can't scale up compute without more bandwidth 
o Reduce visual effects 
o Decrease resolution 
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Admiring the Problem 


Texture Upload Texel Fetch Framebuffer Write 


Screen 


Texture Mipmap 1 


Texture Mipmap 1 
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Display Scanout 


OpenSource 


Software > TECHNOLOGY CENTER 


Software 


Texture Upload 


The application needs to get its assets (geometry Texture Upload 


data, texture data, precompiled shaders, etc.) into 
memory from storage. 


Texture Mipmap 1 


Applicatio 
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Texturing Fetch/Filtering 


Anisotropic 
16 trilinear probes 


128x 


=> — o. — ç o. 5 % y 
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Software 


#version 330 


Sampling/Writing 


uniform sampler2D tex; 


in vec2 texCoord; 
out vec4 fragColor; 


void maint) 
vec4 temp texelFetch (tex, ivec2 (texCoord)); 


r fragColor temp; 


Framebuffer Write 


Texel Fetch 


Texture Mipmap 1 
Screen 


GPU 
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Compositing 


Compositor is responsible for taking client 
application e window contents and amalgamates 
into a single image for display. 


Like a window manager, but with offscreen 
buffers 


e Needs to read from applications rendered 
data, and write to the screen 


Software 


Compositor J 


Screen 
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Display Engine 


Specialized, fixed function hardware which 
sources pixel data and pushes it out over some 
display protocol; possibly blending, and scaling 
the pixels along the way. 


Push out 
vblank hblank hblank 
a 


DMA 
pixels 
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Bytes Per Component 


Bandwidth Costs 
Operation Color Dept Desc. Bandwidth R/W 
Ps 


Texture Upload 1Bpc (RGBxs) | File to DRAM 16KB (64 * 64 * 4) W 
Texel Fetch 1Bpc (RGBx8) | DRAM to Sampler 16KB (64 * 64 * 4) R 
FB Write 1Bpc (RGBxs) | GPU to DRAM 16KB (64 * 64 * 4) W 
Compositing 1Bpc(RoBxə) | DRAM to DRAM 32KB (64 * 64 * 4 * 2) R+W 
Display Scanout 1Bpc (RGBxs) | DRAM to PHY 16KB (64 * 64 * 4) R 


Total Bandwidth = (16 + 16 + 16 + 32 + 16) * 60Hz = 5.625 MB/s 
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At Least it Looks Better 


Filter Mode Multiplier (texel fetch stage) 
Nearest 1x 

Bilinear 4x 

Trilinear 8x 

Aniso 4x 32x 

Aniso 16x 128x 


Total Bandwidth 
5.625 MB/s 
11.25 MB/s 
18.75 MB/s 
63.75 MB/s* 


243.75 MB/s* 


* Oblique angle + implementation details would reduce further 


Software 


OpenSource 


TECHNOLOGY CENTER 


13 


Proposed Solution: Increase Headroom 


Technology (~2013) | Technology (~2016) Improvement 


Color Depth | Operation Bandwidth | 
DDR3-2133 | DDR4-3200 50% 
1Bpc Texture Upload 16KB (64 * 64 * 4) 34GB/s (dual channel) 51.2 GB/s (dual channel) 
| 
* * GTX 780 (Kepler) | GTX1080 (Pascal) 22% 
1Bpc Texel Fetch 16KB (64 * 64 * 4) GDDR5 288 GB/s | GDDR5X 352 GB/s 
1Bpc FB Write 16KB (64 64 4) Radeon R9 290X (Hawaii) | Radeon R9 Fury (Fiji) | 60% 
GDDR5 (320GB/s) HBM1 512GB/s 
1Bpc Composite 32KB (64 * 64 * 4* 2) | | 
LPDDR3-1600 | LPDDR4-3200 100% 
1Bpc Scanout 16KB (64 * 64 * 4) 12.8 GB/s (single channel) | 25.6 GB/s (single channel) 


Total Bandwidth = 5.625 MB/s 
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Proposed Solution: Hardware Composition 


I . . Texture Upload Texel Fetch 
Hardware is capable of having multiple EB Texture Mipmap 1 
hardware planes. Use them. E ees " "Web 
Color Depth Operation Bandwidth 
1Bpc (RGBA8) Texture Upload 16KB (64 * 64 * 4) 
1Bpc (RGBA8) Texel Fetch 16KB (64 * 64 * 4) 
1Bpc (RGBA8) FB Write 16KB (64 * 64 * 4) 
1Bpc (RGBA8) Scanout 16KB (64 * 64 * 4) 


Total Bandwidth = 3.75 MB/s (33% savings) 
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Proposed Solution: Texture Compression 


e DXT1 (8:1) 
ETC1/2 (4:1) 
e ASTC (variable, 6:1) 


Color Depth 
DXT1 
DXT1 
1Bpc 
1Bpc 


1Bpc 


Software 


Operation 
Texture Upload 
Texel Fetch 
FB Write 
Composite 


Scanout 


Total Bandwidth = 3.925 MB/s (30%) 


Bandwidth 


16KB / 8 

16KB /8 

16KB (64 * 64 * 4) 
32KB (64 * 64 * 4 * 2) 


16KB (64 * 64 * 4) 


nvcompress -nomips -nocuda |-bc1]-/cube.png ~/cube.dds 


DXT1 
16 pixels 
(4Bpp = 64B) 


V 
0xF7C3 cO = (encoded 5:6:5) 
0x0822 cl = (encoded 5:6:5) 
1 


2 
OxA52D Co = —cCo + —C1 
3 3 


1 2 
0x5296 c3 = =C0 + =C1 


64 x 64 texture 
4Bpp * 64pixels * 64rows 
16K 


64 x 64 texture 
8BperBlock * 4blocks * 64rows 
2K 
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Problems (Increase Bandwidth) 


1. Limited by process and design 
2. Costly for manufacturing 

a. New memory modules 

b. New boards 

c. Utilizes new fabrication process 
3. May be power hungry 


Rating: Sure. Won't hold my breath 
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Problems (More Planes) 


yy 1. Hardware specific 
a Not all hardware can 
composite the same number of 
planes 

2. Max planes is small 

a. Increasing this significantly 

isn't feasible (die size) 

3. Only helps compositing step 


Rating: Great, doesn’t scale 
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Problems (Texture Compression) 


Software 


May be lossy Sexiigigadnly 
Hardware compatibility 
a. Better formats require new hardware 
b. Increased gate counts 
Patents or proprietary 
Misses display improvement 
Doesn't play nicely with all filtering 
methods (aniso) 


Rating: Great, but lacking 


DXT1 
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Introducing E2E Lossless Compression 


e Supplement other bandwidth saving techniques 

o Doesn't reduce size (in fact internally things get larger). 
e Internal hardware blocks compress/decompress on the fly. 

o Display 

o Media 


o Texture units 


Lossless 
e Transparentto applications/tools 

o Easier development 

o Hardware improvements automatically help 
e No offline compression necessary 
e Compression benefits through display 
e Can be huge when texturing is small amount of 
bandwidth consumption 


e Relatively low compression 
o 2:1 max on current Intel 
o 4:1 max seems to be industry standard 
o Not everything will be compressible 
m Wil never get max. 
e Limited by hardware 
o However, many GPUs getting this 
m UBWC (Qualcomm), AFBC (ARM), 
DCC (AMD), DCC (Nvidia), PVRIC 
(Imagination) 
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Several Opportunities for Savings 


Oye Upload Texel Fetch Framebuffer Write 


Texture Mipmap 1 


< Texture Mipmap 1 ial FS YN A Di J 
| Ms HE # Se 7 — P | < a Ya 
wees ¢ e 128! id: J 
= <a | Sab 


Sample & Compress 


Texture 
ampler 


b i a us tes be ~ 4 aa eme "I 
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Semi-realistic example 


2b per CL pair 


00 = Clear Color 
01 = Use LUT w/ CLO 


10 = Invalid 


11 = Uncompresseq 


Dee 


Dumb Example 


Naive implementation will get 2:1 compression 


= = when a pair of cachelines has 12 or less colors 
16 CL Hagan HES MENHS AER 
(64 rows) =Z Ee = 
32 CL 
(128px) 


INES 


color control surface 
CCS Cacheline 


DW0 
DW1 
DW2 
DW3 
DW4-15 


Indices for 0-7 (4b) 
Indices for 8-15 (4b) 
Indices for 16-23 (4b) 
Indices for 24-31 (4b) 
Color 0-11 


Software 


CCS cacneline = 32K main frame 


Bes? 


UNUSED 
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E2E Bandwidth Savings (2:1 compression) 


Operation Color Depth Desc. Bandwidth R/W 
Texture Upload 1Bpc (RGBX8) File to DRAM 16KB (64 * 64 * 4) W 
Texel Fetch 1Bpc (RGBX8) DRAM to Sampler 16KB (64 * 64 * 4) R 
FB Write Compressed (2:1) GPU to DRAM 16KB (64 * 64 * 4) / 2 W 
Compositing Compressed (2:1) DRAM to DRAM 32KB (64*64*4*2)/2 R+W 
Display Scanout | Compressed (2:1) DRAM to PHY 16KB (64 * 64 * 4) /2 R 


Total Bandwidth = (16 + 16 + 8 + 16 + 8) * 60Hz = 3.75 MB/s (33%) 
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Molehills out of Mountains! 


Technique Bandwidth BW Savings 


Base 5.625 MB/s - 


16 + 16 + 16 + 32 + 16 


+ HW compositing 3.75 MB/s 33% 


16+16+16+32+ 16 


+ DXT1 Compression 2.11 MB/s 62% 


2+2+16+32+16 


+ E2E Compression 1.17 MB/s 79% 


2+2+8+32+8 


Software 


Disk Savings 
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Intermission 


Software 


Intel 
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Implementation Challenges 


1. Currently, everything treats a framebuffer as a buffer of 


pixels. 
a. The main buffer is no longer just pixel data. 
b. There's another buffer! (similar to planar formats) 


2. Buffer allocation, buffer import/export, and display server 
protocol need to be made aware of this. 
3. Applications and compositors cannot rely on compression 


working everywhere. 
a. Ex. Skylake doesn’t allow compression on pipe C 
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Several Solutions 


1. Encode “modifiers” in fourcc format 
a. VAL does this (include/uapi/linux/videodev2.h) 
b. Works well for entirely proprietary formats 


c. Concern about amount of bits for modifiers in DRM 
i. Graphics formats combinatorially explode faster [apparently] 
ii. Even 64b modifier was questioned 


d. Never really considered (not sure why) 
2. Intel specific plane property (original proposal) 
a. Many other drivers shared similar problem. 
b. KMS clients wanted a hardware agnostic mechanism 
c. Protocol still required anyway 
3. dma-buf metadata 
a. Justa get/set IOCTL for adding modifiers to a dma-buf 


Software 
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The Res u It = M od ifie rs commit e3eb3250d84ef97b766312345774367b6a310db8 


Author: Rob Clark <robdclark@gmail.com> 
Date: Thu Feb 5 14:41:52 2015 +0000 


e Some support already landed drm: add support for tiled/compressed/etc modifier in addfb2 
Describes modifications to a 
buffer's layout 

e Easy to add new modifiers to support 
different tiling formats 

e Missing some key pieces 

o Query interface 


o Protocol 
o Driver implementation 


e Compression somewhat muddies the 
definition 


u€ H id C s * 
drm/drm_mode.h [sit(drm-intel-next-queued)] JM] (449,1) 58%] 
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Step 1: Compositor Negotiation 


Query all “sink” APIs to find out what 
modifiers are supported for the given format, 
and hardware. 


kernel 


drmModeGetConnector : drm_crtc 


drmModeGetProperty ` ` | drm plane 
drmModeGetPropertyBlob| be 
eglQueryDmaBufModifiersEXT : E=) 


“Z 
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Queries 


e Biobifier (KMS blob property for dm plane) 
o What modifiers does the plane support? 
e EGL extensions 
o EXT_image_dma_buf_import_modifiers 
=  eglQueryDmaBufModifiersEXT 


e What modifiers does my format support? 
e “is used to query the dma_buf format modifiers supported by <dpy> for the given format.” 


e Vulkan/WSI (WIP) 
o VK_MESAX_external_image_dma_buf. 


Plumbers: Collabora (funded by Intel and Google), Google, Intel 
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Step 2: Take That and Shove It 


down your protocol pipe 


With the optimal modifiers in hand, some protocol will tell the client which modifiers it 
might want to use. 


Modifiers which can 

be used for direct scanout, 
and used to sample 
from EGLImages 


Send modifiers over protocol 
ie. DRI3.1, Wayland, kmscube 
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Software 


Protocols 


e Wayland 
o “zwp_linux_buffer_params_v1" version="3” 
e DRI3.1 


o  Multi-plane support 
o XDRI3GetSupportedModifiers 


Plumbers: Collabora (paid for by Intel) 
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Step 3: Making BOs 


Could be different EGL! 
Next, the client creates the 
buffer either directly, or 
indirectly with the formats 


and modifiers it desires. BO 


eglCreateWindowSurface 
eglSwapBuf fers 


gbm_bo get_modifier 
gbm_ surface Create with modifiers 
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Buffer Creation 


e EGL 
o eglCreateWindowSurface 
m Wayland 
= X11 
° (Mesa) Ask over DRI3.1 what's supported 
° (Mesa) Call into DRI driver to create an image 
e GBM 
o gbm " create with_modifiers 
e DRllmage 


o createlmageWithModifiers (made for GBM) 
o createlmageFromDmaBufs2 (made for DRI 3.1) 
e Vulkan/WSI (WIP) 


Plumbers: Collabora, Google, Intel 
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The Whole Story Thus Far 


Modifiers which can 

be used for direct scanout, 
and used to sample 
from EGLImages 


2. The final list of supported format/modifier combinations 
is sent to the DRI client over the display server protocol or 
some other pre-defined protocol when not using normal 
client/server display server protocols. 


kernel 


drmModeGetConnector | [drm,erte 

drmModeGetProperty ` ` [dm plane 

drmModeGetPropertyBlob| a 
leglQueryDmaBufModifiersEXT : f=] 


1. The KMS client AKA the compositor will query the kernel via KMS APIs to find 

out what modifiers are supported for the formats it wishes to use on the sinks 

it wishes to use. To do this, first the KMS client will use the connector found 

via drmModeGetConnector(), and get a random CRTC from the connector. Next, the 
primary plane can be found with drmModeGetProperty, and finally, the modifiers 
comes in as a blob drmModeGetPropertyBlob(). 


1b. The compositor also queries other sink APIs, eg EGL for texturing from 
buffer or encoder for screen capture. 


Software 


eglSwapBuffers 
Send modifiers over protocol > 
ie. DRI3.1, Wayland, kmscube ” 


Could be different EGL! 


he 


eglCreateWindowSurface 


a 


gbm_bo_get_modifier 
gbm_surface_create with modifiers 
3. The DRI client may query the supported modifiers it wishes to 
use based on the format it wants to render to and creates the buffer. 
With EGL, this will normally be handled by the EGL winsys layer. 
It could also be done explicitly via GBM which a compositor or very low level 
application may do. If it doesn't do this, no modifiers will be used. 


Intel 
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Display it 


1. Software Compositing (Option) 
a. EGL_EXT_image_dma_buf_import_modifiers 
b. Much work required 

2. Hardware compositing (Option) 
a drmModeAddFB2WithModifiers 


b. Relatively minor changes required. AddFB2 already supported modifiers 
i. Add new modifiers to drm_fourcc.h 
ii. Added error checking when modifiers change plane count. 
iii. ` Driver specific handling of modifiers. 
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He'll Flip You (for real) 


Software 


Option 1: 

Create EGLImage 
then composite via 
texturing 


Option 2: 

Create drm_framebuffer 
from client's BO then 
flip directly 


_ - = - ` - - - „Some other BO 
å HW Plane 


dma-buf -> EGLImage 


. 


2-77 ~~ HW Planes 


dma-buf -> drm_framebuffer 


Op 
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Preliminary Results 


“Benchmark” Original 
kmscube 1.22 GB/s 
glxgears 1775 FPS 
TRex 


Software 


CCS 


600 MB/s 
3900 FPS 


%improved 
51 (2x) 
54 (2.2x) 


2.3 (.02x) 


OpenSource 
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Takeaways 


e Memory bandwidth requirements for graphics 
workloads can be astronomical. 

e Don't assume texture compression is the end 
of the bandwidth story. 

e Modifiers “modify” the framebuffer's pixel 
layout. 

e Lossless compression reduces bandwidth, 
not size 

o Many GPUs support this transparently 

e Hardware compositing is great. 

e Getting features like this plumbed through 
can easily be a multi-year effort. 

e Haiku isn't supported :/ 
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Thank Yous 


Platinum Level Liviu Dudau, Arm Ltd 
Kristian Hogsberg, Google Eric Engestrom, Imagination Technologies 
Daniel Stone, Collabora Varad Gautam, Collabora 
Topi Pohjolainen, Intel 
Gold Level Lucas Stach, Pengutronix 
Rob Clark, Red Hat Emil Velikov, Collabora 
Jason Ekstrand, Intel Chad Versace, Google 
Ville Syrjalà, Intel Tomeu Vizoso, Collabora 


Daniel Vetter, Intel 
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Q&A 
(not about EGLStreams) 


oe OpenSource 
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